nhslogo CS4132 Data Analytics

The Trends of American Football by Edwin Koon¶

Important Note: Please keep your report concise and relevant (i.e. show only relevant steps and visualizations used to answer your research questions).

Table of Content (with relevant hyperlinks to sections)¶

Motivation & Background

Summary of Research Questions & Results

Dataset

Methodology

  • Data Acquisition

  • Data Cleaning

EDA

  • 1. Which teams have won the most games and which teams scored the most per game?

  • 2. Which strategies work best for scoring and winning games, and what are the most used strategies?

  • 3. Do external factors such as weather and time of year make a significant difference in performance?

  • 4. Are there any significant trends in players, for example, who has played as which position the most?

Results Findings & Conclusion

  • 1. Which teams have won the most games and which teams scored the most per game?

  • 2. Which strategies work best for scoring and winning games, and what are the most used strategies?

  • 3. Do external factors such as weather and time of year make a significant difference in performance?

  • 4. Are there any significant trends in players, for example, who has played as which position the most?

Recommendations or Further Works

References

Motivation and Background¶

Give an overview of the project, motivation, background and goals.

American football is the most popular sport in the United States, with many football events taking place every year, such as Super Bowl and NFL. Almost 100 million people tuned in to watch the 2019 Super Bowl. Its popularity is steadily growing both in the US and overseas. I aim to study the various statistics related to football leagues, such as the winrates and watch rates of different football matches, as well as the patterns in the players themselves and how they change over time, and whether these trends differ across various contexts, such as leagues and post season games.

Summary of Research Questions & Results¶

Repeat your research questions in a numbered list. After each research question, clearly state the answer/conclusion you determined. Do not give details or justifications yet — just the answer
  1. Which teams have won the most games and which teams scored the most per game?
  1. Which strategies work best for scoring and winning games, and what are the most used strategies?
  1. Do external factors such as weather and time of year make a significant difference in performance?
  1. Are there any significant trends in players, for example, who has played as which position the most?

Dataset¶

Numbered list of dataset (with downloadable links) and a brief but clear description of each dataset used. Draw reference to the numbering when describing methodology (data cleaning and analysis).
  1. http://nflsavant.com/pbp_data.php?year=2021, http://nflsavant.com/pbp_data.php?year=2020, http://nflsavant.com/pbp_data.php?year=2019, http://nflsavant.com/pbp_data.php?year=2018, http://nflsavant.com/pbp_data.php?year=2017, http://nflsavant.com/pbp_data.php?year=2016, http://nflsavant.com/pbp_data.php?year=2015, http://nflsavant.com/pbp_data.php?year=2014, http://nflsavant.com/pbp_data.php?year=2013, (NFL Play by Play data for years 2013 to 2021)

  2. http://nflsavant.com/dump/combine.csv?year=2015, http://nflsavant.com/dump/players_2013-12-12.csv, http://nflsavant.com/dump/weather_20131231.csv Player, weather and miscellaneous data

  3. https://github.com/ryurko/nflscrapR-data/tree/master/games_data https://github.com/ryurko/nflscrapR-data/tree/master/legacy_data https://github.com/ryurko/nflscrapR-data/tree/master/play_by_play_data https://github.com/ryurko/nflscrapR-data/tree/master/roster_data (Game and Roster data for years 2009 to 2019)

  4. https://en.wikipedia.org/wiki/Wikipedia:WikiProject_National_Football_League/National_Football_League_team_abbreviations (List of NFL Team Abbreviations)

Methodology ¶

You should demonstrate the data science life cycle here (from data acquisition to cleaning to EDA and analysis etc).
In [1]:
import warnings
warnings.filterwarnings('ignore')

Data Acquisition¶

Display the data which will be used in the project. The data should be saved in .xlsx or .csv format to be submitted with the project. If webscraping has been done to obtain your data, save your webscraping code in another jupyter notebook as appendix to be submitted separately from the report. Import and display each dataset in a dataframe. For each dataset, give a brief overview of the data it contains, and explain the meaning of columns that are relevant to the project.
In [2]:
import pandas as pd
import glob
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import squarify
from pywaffle import Waffle
import joypy
import plotly.express as px
In [3]:
# Dataset 1: Combined data of all games played from 2009 to 2019 seasons
all_games = glob.glob("games_data/*/*.csv")
df_ag = pd.concat((pd.read_csv(f, on_bad_lines='skip') for f in all_games))
df_ag = df_ag.reset_index(drop=True)
df_ag.head(5)
Out[3]:
type game_id home_team away_team week season state_of_game game_url home_score away_score Unnamed: 0
0 post 2010010901 CIN NYJ 18 2009 POST http://www.nfl.com/liveupdate/game-center/2010... 14.0 24.0 NaN
1 post 2010010900 DAL PHI 18 2009 POST http://www.nfl.com/liveupdate/game-center/2010... 34.0 14.0 NaN
2 post 2010011001 NE BAL 18 2009 POST http://www.nfl.com/liveupdate/game-center/2010... 14.0 33.0 NaN
3 post 2010011000 ARI GB 18 2009 POST http://www.nfl.com/liveupdate/game-center/2010... 51.0 45.0 NaN
4 post 2010011601 NO ARI 18 2009 POST http://www.nfl.com/liveupdate/game-center/2010... 45.0 14.0 NaN
In [4]:
# Dataset 2: Combined play by play data of all games played from 2013 to 2021
pbp = glob.glob("play_by_play_data/all/*.csv")
df_pbp = pd.concat((pd.read_csv(f, on_bad_lines='skip') for f in pbp))
df_pbp.head(5)
Out[4]:
GameId GameDate Quarter Minute Second OffenseTeam DefenseTeam Down ToGo YardLine ... IsTwoPointConversion IsTwoPointConversionSuccessful RushDirection YardLineFixed YardLineDirection IsPenaltyAccepted PenaltyTeam IsNoPlay PenaltyType PenaltyYards
0 2013091512 2013-09-15 4 5 9 JAX LV 4 13 16 ... 0.0 0 NaN 16 OWN 1.0 OAK 0.0 OFFENSIVE HOLDING 10.0
1 2013091512 2013-09-15 4 0 36 JAX LV 2 10 65 ... 0.0 0 NaN 35 OPP 0.0 NaN 0.0 NaN 0.0
2 2013091512 2013-09-15 4 5 21 JAX LV 3 6 23 ... 0.0 0 NaN 23 OWN 0.0 NaN 0.0 NaN 0.0
3 2013091512 2013-09-15 4 5 24 JAX LV 2 6 23 ... 0.0 0 NaN 23 OWN 0.0 NaN 0.0 NaN 0.0
4 2013091512 2013-09-15 4 5 55 JAX LV 1 10 19 ... 0.0 0 NaN 19 OWN 0.0 NaN 0.0 NaN 0.0

5 rows × 45 columns

In [5]:
# Dataset 3: All post-season games data from 2009 to 2019
pbp_post = glob.glob("play_by_play_data/post_season/*.csv")
df_pbpp = pd.concat((pd.read_csv(f, on_bad_lines='skip') for f in pbp_post))
df_pbpp.head(5)
Out[5]:
play_id game_id home_team away_team posteam posteam_type defteam side_of_field yardline_100 game_date ... penalty_player_name penalty_yards replay_or_challenge replay_or_challenge_result penalty_type defensive_two_point_attempt defensive_two_point_conv defensive_extra_point_attempt defensive_extra_point_conv Unnamed: 0
0 36 2010010901 CIN NYJ CIN home NYJ NYJ 30.0 2010-01-09 ... NaN NaN 0 NaN NaN 0.0 0.0 0.0 0.0 NaN
1 61 2010010901 CIN NYJ CIN home NYJ NYJ 36.0 2010-01-09 ... NaN NaN 0 NaN NaN 0.0 0.0 0.0 0.0 NaN
2 82 2010010901 CIN NYJ CIN home NYJ NYJ 30.0 2010-01-09 ... NaN NaN 0 NaN NaN 0.0 0.0 0.0 0.0 NaN
3 104 2010010901 CIN NYJ CIN home NYJ NYJ 30.0 2010-01-09 ... NaN NaN 1 upheld NaN 0.0 0.0 0.0 0.0 NaN
4 166 2010010901 CIN NYJ NYJ away CIN NYJ 74.0 2010-01-09 ... NaN NaN 0 NaN NaN 0.0 0.0 0.0 0.0 NaN

5 rows × 257 columns

In [6]:
# Dataset 4: All pre-season games data from 2009 to 2019
pbp_pre = glob.glob("play_by_play_data/pre_season/*.csv")
df_pbppr = pd.concat((pd.read_csv(f, on_bad_lines='skip') for f in pbp_pre))
df_pbppr.head(5)
Out[6]:
play_id game_id home_team away_team posteam posteam_type defteam side_of_field yardline_100 game_date ... penalty_player_id penalty_player_name penalty_yards replay_or_challenge replay_or_challenge_result penalty_type defensive_two_point_attempt defensive_two_point_conv defensive_extra_point_attempt defensive_extra_point_conv
0 37 2009080950 TEN BUF TEN home BUF BUF 30.0 2009-08-09 ... NaN NaN NaN 0 NaN NaN 0.0 0.0 0.0 0.0
1 59 2009080950 TEN BUF TEN home BUF TEN 74.0 2009-08-09 ... NaN NaN NaN 0 NaN NaN 0.0 0.0 0.0 0.0
2 83 2009080950 TEN BUF TEN home BUF TEN 64.0 2009-08-09 ... NaN NaN NaN 0 NaN NaN 0.0 0.0 0.0 0.0
3 104 2009080950 TEN BUF TEN home BUF TEN 67.0 2009-08-09 ... NaN NaN NaN 0 NaN NaN 0.0 0.0 0.0 0.0
4 128 2009080950 TEN BUF TEN home BUF TEN 55.0 2009-08-09 ... NaN NaN NaN 0 NaN NaN 0.0 0.0 0.0 0.0

5 rows × 256 columns

In [7]:
# Dataset 5: All  regular season games data from 2009 to 2019
pbp_reg = glob.glob("play_by_play_data/regular_season/*.csv")
df_pbpr = pd.concat((pd.read_csv(f, on_bad_lines='skip') for f in pbp_reg))
df_pbpr.head(5)
Out[7]:
play_id game_id home_team away_team posteam posteam_type defteam side_of_field yardline_100 game_date ... penalty_player_id penalty_player_name penalty_yards replay_or_challenge replay_or_challenge_result penalty_type defensive_two_point_attempt defensive_two_point_conv defensive_extra_point_attempt defensive_extra_point_conv
0 46 2009091000 PIT TEN PIT home TEN TEN 30.0 2009-09-10 ... NaN NaN NaN 0 NaN NaN 0.0 0.0 0.0 0.0
1 68 2009091000 PIT TEN PIT home TEN PIT 58.0 2009-09-10 ... NaN NaN NaN 0 NaN NaN 0.0 0.0 0.0 0.0
2 92 2009091000 PIT TEN PIT home TEN PIT 53.0 2009-09-10 ... NaN NaN NaN 0 NaN NaN 0.0 0.0 0.0 0.0
3 113 2009091000 PIT TEN PIT home TEN PIT 56.0 2009-09-10 ... NaN NaN NaN 0 NaN NaN 0.0 0.0 0.0 0.0
4 139 2009091000 PIT TEN PIT home TEN PIT 56.0 2009-09-10 ... NaN NaN NaN 0 NaN NaN 0.0 0.0 0.0 0.0

5 rows × 256 columns

In [8]:
# Dataset 6: Player post-season roster data from 2009 to 2018
rost_post = glob.glob("roster_data/post_season/*.csv")
df_rp = pd.concat((pd.read_csv(f, on_bad_lines='skip') for f in rost_post))
df_rp.head(5)
Out[8]:
season season_type full_player_name abbr_player_name team position gsis_id
0 2009 post Aaron Rodgers A.Rodgers GB QB 00-0023459
1 2009 post Adrian Peterson A.Peterson MIN RB 00-0025394
2 2009 post Ahman Green A.Green GB RB 00-0006305
3 2009 post Alex Smith A.Smith PHI TE 00-0023506
4 2009 post Andre Caldwell A.Caldwell CIN WR 00-0026237
In [9]:
# Dataset 7: Player pre-season roster data from 2009 to 2019
rost_pre = glob.glob("roster_data/pre_season/*.csv")
df_rpr = pd.concat((pd.read_csv(f, on_bad_lines='skip') for f in rost_pre))
df_rpr.head(5)
Out[9]:
season season_type full_player_name abbr_player_name team position gsis_id
0 2009 pre Aaron Kelly A.Kelly ATL WR 00-0026504
1 2009 pre Aaron Walker A.Walker CLE TE 00-0022060
2 2009 pre Adam DiMichele A.DiMichele PHI QB 00-0026895
3 2009 pre Alex Brink A.Brink HOU QB 00-0026363
4 2009 pre Alex Mortensen A.Mortensen TEN QB \tGSIS ID:
In [10]:
# Dataset 8: Player regular season roster data from 2009 to 2019
rost_reg = glob.glob("roster_data/regular_season/*.csv")
df_rr = pd.concat((pd.read_csv(f) for f in rost_reg))
df_rr.head(5)
Out[10]:
season season_type full_player_name abbr_player_name team position gsis_id
0 2009 reg Aaron Brown A.Brown DET RB 00-0027129
1 2009 reg Aaron Rodgers A.Rodgers GB QB 00-0023459
2 2009 reg Aaron Stecker A.Stecker ATL RB 00-0015601
3 2009 reg Adrian Peterson A.Peterson CHI RB 00-0021306
4 2009 reg Adrian Peterson A.Peterson MIN RB 00-0025394
In [11]:
# Dataset 9: Pass, rush and receive data from 2019 to 2022
prr = pd.read_csv("misc/nfl_pass_rush_receive_raw_data.csv")
prr.head(5)
Out[11]:
game_id player_id pos player team pass_cmp pass_att pass_yds pass_td pass_int ... OT Roof Surface Temperature Humidity Wind_Speed Vegas_Line Vegas_Favorite Over_Under game_date
0 201909050chi RodgAa00 QB Aaron Rodgers GNB 18 30 203 1 0 ... False outdoors grass 65 69 10 -3.5 CHI 47.0 5/9/2019
1 201909050chi JoneAa00 RB Aaron Jones GNB 0 0 0 0 0 ... False outdoors grass 65 69 10 -3.5 CHI 47.0 5/9/2019
2 201909050chi ValdMa00 WR Marquez Valdes-Scantling GNB 0 0 0 0 0 ... False outdoors grass 65 69 10 -3.5 CHI 47.0 5/9/2019
3 201909050chi AdamDa01 WR Davante Adams GNB 0 0 0 0 0 ... False outdoors grass 65 69 10 -3.5 CHI 47.0 5/9/2019
4 201909050chi GrahJi00 TE Jimmy Graham GNB 0 0 0 0 0 ... False outdoors grass 65 69 10 -3.5 CHI 47.0 5/9/2019

5 rows × 69 columns

In [12]:
# Dataset 10: Details of players who played from 1920 to 2013
players = pd.read_csv("misc/players_2013-12-12.csv")
players.head(5)
Out[12]:
name first_name last_name birth_city birth_state birth_country birth_date college draft_team draft_round ... draft_year position height weight death_date death_city death_state death_country year_start year_end
0 Isaako Aaitui Isaako Aaitui NaN NaN NaN 1/25/1987 NaN NaN NaN ... NaN NT 4-Jun 315.0 NaN NaN NaN NaN 2013 2013
1 Faye Abbott Faye Abbott Clearport OH USA 1895-08-16 Syracuse NaN NaN ... NaN BB-FB-TB-QB-WB- 8-May 182.0 1/22/1965 Dayton OH NaN 1921 1929
2 Vince Abbott Vince Abbott London NaN England 5/31/1958 Washington NaN NaN ... NaN K 11-May 207.0 NaN NaN NaN NaN 1921 1929
3 Duke Abbruzzi Duke Abbruzzi Warren RI USA 8/3/1917 Rhode Island NaN NaN ... NaN HB-DB 10-May 175.0 12/6/1982 Newport RI NaN 1921 1929
4 Karim Abdul-Jabbar Karim Abdul-Jabbar Los Angeles CA USA 6/28/1974 UCLA Miami Dolphins 3rd ... 1996.0 RB 10-May 194.0 NaN NaN NaN NaN 1996 2000

5 rows × 21 columns

In [13]:
# Dataset 11: Weather data from 1960 to 2013
weather = pd.read_csv("misc/weather_20131231.csv")
weather.head(5)
Out[13]:
id home_team home_score away_team away_score temperature wind_chill humidity wind_mph weather date
0 196009230ram Los Angeles Rams 21 St. Louis Cardinals 43 66 NaN 78% 8.0 66 degrees- relative humidity 78%- wind 8 mph 9/23/1960
1 196009240dal Dallas Cowboys 28 Pittsburgh Steelers 35 72 NaN 80% 16.0 72 degrees- relative humidity 80%- wind 16 mph 9/24/1960
2 196009250gnb Green Bay Packers 14 Chicago Bears 17 60 NaN 76% 13.0 60 degrees- relative humidity 76%- wind 13 mph 9/25/1960
3 196009250sfo San Francisco 49ers 19 New York Giants 21 72 NaN 44% 10.0 72 degrees- relative humidity 44%- wind 10 mph 9/25/1960
4 196009250clt Baltimore Colts 20 Washington Redskins 0 62 NaN 80% 9.0 62 degrees- relative humidity 80%- wind 9 mph 9/25/1960

Data Cleaning ¶

For data cleaning, be clear in which dataset (or variables) are used, what has been done for missing data, how was merging performed, explanation of data transformation (if any). If data is calculated or summarized from the raw dataset, explain the rationale and steps clearly.

The relevant columns are renamed in each dataset:

In [14]:
columns = ["GameId","GameDate","OffenseTeam","DefenseTeam","Down","ToGo","YardLine", "Description", "PlayType", "IsIncomplete", "IsSack", "IsTouchdown", "IsFumble", "PenaltyTeam", "PenaltyType", "PenaltyYards"]
df_pbpp = df_pbpp.drop("Unnamed: 0", axis=1)
pbp_combined_raw = pd.concat([df_pbpp, df_pbppr, df_pbpr], ignore_index=True)
pbp_combined_raw.extra_point_attempt = pbp_combined_raw.extra_point_attempt.shift(-1)
pbp_combined_raw.two_point_attempt = pbp_combined_raw.two_point_attempt.shift(-1)
pbp_combined_raw.head(5)
Out[14]:
play_id game_id home_team away_team posteam posteam_type defteam side_of_field yardline_100 game_date ... penalty_player_id penalty_player_name penalty_yards replay_or_challenge replay_or_challenge_result penalty_type defensive_two_point_attempt defensive_two_point_conv defensive_extra_point_attempt defensive_extra_point_conv
0 36 2010010901 CIN NYJ CIN home NYJ NYJ 30.0 2010-01-09 ... NaN NaN NaN 0 NaN NaN 0.0 0.0 0.0 0.0
1 61 2010010901 CIN NYJ CIN home NYJ NYJ 36.0 2010-01-09 ... NaN NaN NaN 0 NaN NaN 0.0 0.0 0.0 0.0
2 82 2010010901 CIN NYJ CIN home NYJ NYJ 30.0 2010-01-09 ... NaN NaN NaN 0 NaN NaN 0.0 0.0 0.0 0.0
3 104 2010010901 CIN NYJ CIN home NYJ NYJ 30.0 2010-01-09 ... NaN NaN NaN 1 upheld NaN 0.0 0.0 0.0 0.0
4 166 2010010901 CIN NYJ NYJ away CIN NYJ 74.0 2010-01-09 ... NaN NaN NaN 0 NaN NaN 0.0 0.0 0.0 0.0

5 rows × 256 columns

As the team name abbreviations are inconsistent across the datasets and the full names are not provided in some of said inconsistent datasets, each team's abbreviation is manually matched with the team name. Note that the current team name is used in case of team name changes in the data timeframe.

In [15]:
teams = {
    "ARZ": "Arizona Cardinals",
    "ARI": "Arizona Cardinals",
    "CRD": "Arizona Cardinals",
    "ATL": "Atlanta Falcons",
    "BLT": "Baltimore Ravens",
    "BAL": "Baltimore Ravens",
    "RAV": "Baltimore Ravens",
    "BUF": "Buffalo Bills",
    "CAR": "Carolina Panthers",
    "CHI": "Chicago Bears",
    "CIN": "Cincinnati Bengals",
    "CLV": "Cleveland Browns",
    "CLE": "Cleveland Browns",
    "DAL": "Dallas Cowboys",
    "DEN": "Denver Broncos",
    "DET": "Detroit Lions",
    "GB": "Green Bay Packers",
    "GNB": "Green Bay Packers",
    "HST": "Houston Texans",
    "HOU": "Houston Texans",
    "HTX": "Houston Texans",
    "IND": "Indianapolis Colts",
    "CLT": "Indianapolis Colts",
    "JAC": "Jacksonville Jaguars",
    "JAX": "Jacksonville Jaguars",
    "KC": "Kansas City Chiefs",
    "KAN": "Kansas City Chiefs",
    "LV": "Las Vegas Raiders",
    "LVR": "Las Vegas Raiders",
    "RAI": "Las Vegas Raiders",
    "OAK": "Las Vegas Raiders",
    "SD": "Los Angeles Chargers",
    "SDG": "Los Angeles Chargers",
    "LAC": "Los Angeles Chargers",
    "STL": "Los Angeles Rams",
    "LAR": "Los Angeles Rams",
    "RAM": "Los Angeles Rams",
    "LA": "Los Angeles Rams",
    "MIA": "Miami Dolphins",
    "MIN": "Minnesota Vikings",
    "NE": "New England Patriots",
    "NWE": "New England Patriots",
    "NO": "New Orleans Saints",
    "NOR": "New Orleans Saints",
    "NYG": "New York Giants",
    "NYJ": "New York Jets",
    "PHI": "Philadelphia Eagles",
    "PIT": "Pittsburgh Steelers",
    "SF": "San Francisco 49ers",
    "SFO": "San Francisco 49ers",
    "SEA": "Seattle Seahawks",
    "TAM": "Tampa Bay Buccaneers",
    "TB": "Tampa Bay Buccaneers",
    "TEN": "Tennessee Titans",
    "OTI": "Tennessee Titans",
    "WAS": "Washington Commanders",
    "APR": "American Football Conference",
    "NPR": "National Football Conference",
    "RIC": "Team Rice",
    "SAN": "Team Sanders",
    "IRV": "Team Irvin",
    "CRT": "Team Carter"
}

Redundant columns are dropped from the players dataset:

In [16]:
players = players[["name","birth_country", "draft_year", "weight", "position", "year_start", "year_end"]]
players.head(5)
Out[16]:
name birth_country draft_year weight position year_start year_end
0 Isaako Aaitui NaN NaN 315.0 NT 2013 2013
1 Faye Abbott USA NaN 182.0 BB-FB-TB-QB-WB- 1921 1929
2 Vince Abbott England NaN 207.0 K 1921 1929
3 Duke Abbruzzi USA NaN 175.0 HB-DB 1921 1929
4 Karim Abdul-Jabbar USA 1996.0 194.0 RB 1996 2000

The humidity data is converted to integer to make graphing easier.

In [17]:
def clean_humidity(string):
    return int(string.replace('%',""))

weather.humidity[weather.humidity.notna()] = weather.humidity[weather.humidity.notna()].apply(clean_humidity)
In [18]:
id_str = weather.id.str[-3:].str.upper()
weather.insert(1, "team_abbr", value=id_str)
weather.id = weather.id.str[:-3]
weather.id = pd.to_numeric(weather.id)
weather.head(5)
Out[18]:
id team_abbr home_team home_score away_team away_score temperature wind_chill humidity wind_mph weather date
0 196009230 RAM Los Angeles Rams 21 St. Louis Cardinals 43 66 NaN 78.0 8.0 66 degrees- relative humidity 78%- wind 8 mph 9/23/1960
1 196009240 DAL Dallas Cowboys 28 Pittsburgh Steelers 35 72 NaN 80.0 16.0 72 degrees- relative humidity 80%- wind 16 mph 9/24/1960
2 196009250 GNB Green Bay Packers 14 Chicago Bears 17 60 NaN 76.0 13.0 60 degrees- relative humidity 76%- wind 13 mph 9/25/1960
3 196009250 SFO San Francisco 49ers 19 New York Giants 21 72 NaN 44.0 10.0 72 degrees- relative humidity 44%- wind 10 mph 9/25/1960
4 196009250 CLT Baltimore Colts 20 Washington Redskins 0 62 NaN 80.0 9.0 62 degrees- relative humidity 80%- wind 9 mph 9/25/1960

The rosters for each season type are combined and grouped by season and season type. Data with corrupted GSIS ID values are removed.

In [19]:
rosters = pd.concat([df_rp, df_rpr, df_rr], ignore_index=True)
rosters.sort_values(by=['season', 'season_type'])
rosters = rosters[rosters.gsis_id != "\tGSIS ID: "]
rosters.head(5)
Out[19]:
season season_type full_player_name abbr_player_name team position gsis_id
0 2009 post Aaron Rodgers A.Rodgers GB QB 00-0023459
1 2009 post Adrian Peterson A.Peterson MIN RB 00-0025394
2 2009 post Ahman Green A.Green GB RB 00-0006305
3 2009 post Alex Smith A.Smith PHI TE 00-0023506
4 2009 post Andre Caldwell A.Caldwell CIN WR 00-0026237

Lastly, the position abbreviations in the roster dataset is manually replaced as the full names are not given anywhere in the dataset. This will allow us to more clearly understand which position a player is playing as later on.

In [20]:
positions = {
    "QB": "Quarterback",
    "RB": "Running Back",
    "WR": "Wide Receiver",
    "TE": "Tight End",
    "FB": "Full Back"
}
In [21]:
rosters["position"] = rosters["position"].map(positions)
rosters.head(5)
Out[21]:
season season_type full_player_name abbr_player_name team position gsis_id
0 2009 post Aaron Rodgers A.Rodgers GB Quarterback 00-0023459
1 2009 post Adrian Peterson A.Peterson MIN Running Back 00-0025394
2 2009 post Ahman Green A.Green GB Running Back 00-0006305
3 2009 post Alex Smith A.Smith PHI Tight End 00-0023506
4 2009 post Andre Caldwell A.Caldwell CIN Wide Receiver 00-0026237

EDA¶

For each research questions shortlisted, outline your methodology in answering them. Discuss interesting observations or results discovered. Please note to only show EDA that's relevant to answering the question at hand. If you have done any data modeling, include in this section.

Question 1: Which teams have won the most games and which teams scored the most per game?¶

To find the teams' performance, we total the scores for each team.

In [22]:
team_scores = dict()
def add_scores_teams(name: str, score: float):
    try:
        team_scores[teams[name]] += score
    except KeyError:
        team_scores[teams[name]] = score
In [23]:
def update_team_scores_1(row):
    add_scores_teams(row.home_team, row.home_score)
    add_scores_teams(row.away_team, row.away_score)
    return row
In [24]:
df_ag.apply(update_team_scores_1, axis='columns')
Out[24]:
type game_id home_team away_team week season state_of_game game_url home_score away_score Unnamed: 0
0 post 2010010901 CIN NYJ 18 2009 POST http://www.nfl.com/liveupdate/game-center/2010... 14.0 24.0 NaN
1 post 2010010900 DAL PHI 18 2009 POST http://www.nfl.com/liveupdate/game-center/2010... 34.0 14.0 NaN
2 post 2010011001 NE BAL 18 2009 POST http://www.nfl.com/liveupdate/game-center/2010... 14.0 33.0 NaN
3 post 2010011000 ARI GB 18 2009 POST http://www.nfl.com/liveupdate/game-center/2010... 51.0 45.0 NaN
4 post 2010011601 NO ARI 18 2009 POST http://www.nfl.com/liveupdate/game-center/2010... 45.0 14.0 NaN
... ... ... ... ... ... ... ... ... ... ... ...
3656 reg 2019122900 BAL PIT 17 2019 POST http://www.nfl.com/liveupdate/game-center/2019... 28.0 10.0 NaN
3657 reg 2019122907 JAX IND 17 2019 POST http://www.nfl.com/liveupdate/game-center/2019... 38.0 20.0 NaN
3658 reg 2019122913 DEN OAK 17 2019 POST http://www.nfl.com/liveupdate/game-center/2019... 16.0 15.0 NaN
3659 reg 2019122914 LA ARI 17 2019 POST http://www.nfl.com/liveupdate/game-center/2019... 31.0 24.0 NaN
3660 reg 2019122915 SEA SF 17 2019 POST http://www.nfl.com/liveupdate/game-center/2019... 21.0 26.0 NaN

3661 rows × 11 columns

We also total the number of games each team participated in.

In [25]:
team_games = dict()
def add_games_teams(name: str):
    try:
        team_games[teams[name]] += 1
    except KeyError:
        team_games[teams[name]] = 1
In [26]:
def update_team_games_1(row):
    add_games_teams(row.home_team)
    add_games_teams(row.away_team)
    return row
In [27]:
df_ag = df_ag.apply(update_team_games_1, axis='columns')
df_ag.head(5)
Out[27]:
type game_id home_team away_team week season state_of_game game_url home_score away_score Unnamed: 0
0 post 2010010901 CIN NYJ 18 2009 POST http://www.nfl.com/liveupdate/game-center/2010... 14.0 24.0 NaN
1 post 2010010900 DAL PHI 18 2009 POST http://www.nfl.com/liveupdate/game-center/2010... 34.0 14.0 NaN
2 post 2010011001 NE BAL 18 2009 POST http://www.nfl.com/liveupdate/game-center/2010... 14.0 33.0 NaN
3 post 2010011000 ARI GB 18 2009 POST http://www.nfl.com/liveupdate/game-center/2010... 51.0 45.0 NaN
4 post 2010011601 NO ARI 18 2009 POST http://www.nfl.com/liveupdate/game-center/2010... 45.0 14.0 NaN

A team's performance is measured with more than just their total score. Thus, also totalled and included is their win count, their winrate and their win/loss ratio. This is shown below.

In [28]:
team_wins = dict()
team_draws = dict()
team_losses = dict()

def add_wins_teams(name: str):
    try:
        team_wins[teams[name]] += 1
    except KeyError:
        team_wins[teams[name]] = 1

def add_draws_teams(name: str):
    try:
        team_draws[teams[name]] += 1
    except KeyError:
        team_draws[teams[name]] = 1
        
def add_losses_teams(name: str):
    try:
        team_losses[teams[name]] += 1
    except KeyError:
        team_losses[teams[name]] = 1
In [29]:
def update_team_wins_1(row):
    if row.home_score > row.away_score:
        add_wins_teams(row.home_team)
        add_losses_teams(row.away_team)
    elif row.home_score < row.away_score:
        add_wins_teams(row.away_team)
        add_losses_teams(row.home_team)
    else:
        add_draws_teams(row.home_team)
        add_draws_teams(row.away_team)
    return row
In [30]:
df_ag = df_ag.apply(update_team_wins_1, axis='columns')
df_ag.head(5)
Out[30]:
type game_id home_team away_team week season state_of_game game_url home_score away_score Unnamed: 0
0 post 2010010901 CIN NYJ 18 2009 POST http://www.nfl.com/liveupdate/game-center/2010... 14.0 24.0 NaN
1 post 2010010900 DAL PHI 18 2009 POST http://www.nfl.com/liveupdate/game-center/2010... 34.0 14.0 NaN
2 post 2010011001 NE BAL 18 2009 POST http://www.nfl.com/liveupdate/game-center/2010... 14.0 33.0 NaN
3 post 2010011000 ARI GB 18 2009 POST http://www.nfl.com/liveupdate/game-center/2010... 51.0 45.0 NaN
4 post 2010011601 NO ARI 18 2009 POST http://www.nfl.com/liveupdate/game-center/2010... 45.0 14.0 NaN

Unfortunately the data has some outliers where some teams far fewer games than all other teams. Upon further investigation these matches are played by "pro teams" that only occur a handful of times per year. These teams are removed due to irrelevance.

In [31]:
team_games = pd.Series({key: team_games[key] for key in team_games if key not in ["American Football Conference","National Football Conference","Team Rice","Team Sanders","Team Irvin","Team Carter"]})
team_scores = pd.Series({key: team_scores[key] for key in team_scores if key not in ["American Football Conference","National Football Conference","Team Rice","Team Sanders","Team Irvin","Team Carter"]})
team_wins = pd.Series({key: team_wins[key] for key in team_wins if key not in ["American Football Conference","National Football Conference","Team Rice","Team Sanders","Team Irvin","Team Carter"]})
team_draws = pd.Series({key: team_draws[key] for key in team_draws if key not in ["American Football Conference","National Football Conference","Team Rice","Team Sanders","Team Irvin","Team Carter"]})
team_losses = pd.Series({key: team_losses[key] for key in team_losses if key not in ["American Football Conference","National Football Conference","Team Rice","Team Sanders","Team Irvin","Team Carter"]})

Using the remaining data, we can see the statistics for each team: their mean score, their total score, their win count, their win rate and their win:loss ratio.

In [32]:
sns.set(rc={'figure.figsize':(15,9)})
mean_score = team_scores/team_games
mean_score = mean_score.sort_values(ascending=False)
ax = sns.barplot(x=mean_score.values, y=mean_score.index)
sns.despine(left=True, bottom=True)
patches = ax.patches
for i in range(len(patches)):
   y = patches[i].get_y() + patches[i].get_height()-0.05
   x = patches[i].get_width() + 0.30
   ax.annotate(str(mean_score.values[i]), (x, y), ha='left')
plt.title("Highest Scoring NFL Teams from 2009 to 2019")
plt.ylabel('Team')
plt.xlabel("Mean Points Scored")
plt.show()
In [33]:
sns.set(rc={'figure.figsize':(15,9)})
team_scores = team_scores.sort_values(ascending=False)
ax = sns.barplot(x=team_scores.values, y=team_scores.index)
sns.despine(left=True, bottom=True)
patches = ax.patches
for i in range(len(patches)):
   y = patches[i].get_y() + patches[i].get_height()-0.05
   x = patches[i].get_width() + 0.80
   ax.annotate(str(team_scores.values[i]), (x, y), ha='left')
plt.title("Highest Scoring NFL Teams from 2009 to 2019")
plt.ylabel('Team')
plt.xlabel("Total Points Scored")
plt.show()
In [34]:
sns.set(rc={'figure.figsize':(15,9)})
team_wins = team_wins.sort_values(ascending=False)
ax = sns.barplot(x=team_wins.values, y=team_wins.index)
sns.despine(left=True, bottom=True)
patches = ax.patches
for i in range(len(patches)):
   y = patches[i].get_y() + patches[i].get_height()-0.05
   x = patches[i].get_width() + 0.80
   ax.annotate(str(team_wins.values[i]), (x, y), ha='left')
plt.title("Highest Winning NFL Teams from 2009 to 2019")
plt.ylabel('Team')
plt.xlabel("Games Won")
plt.show()
In [35]:
sns.set(rc={'figure.figsize':(15,9)})
team_ratio = team_wins/team_games
team_ratio = team_ratio.sort_values(ascending=False)
ax = sns.barplot(x=team_ratio.values, y=team_ratio.index)
sns.despine(left=True, bottom=True)
patches = ax.patches
for i in range(len(patches)):
   y = patches[i].get_y() + patches[i].get_height()-0.05
   x = patches[i].get_width() + 0.005
   ax.annotate(str(team_ratio.values[i]), (x, y), ha='left')
plt.title("Highest Winning NFL Teams from 2009 to 2019")
plt.ylabel('Team')
plt.xlabel("Winrate")
plt.show()
In [36]:
sns.set(rc={'figure.figsize':(15,9)})
team_wl = team_wins/team_losses
team_wl = team_wl.sort_values(ascending=False)
ax = sns.barplot(x=team_wl.values, y=team_wl.index)
sns.despine(left=True, bottom=True)
patches = ax.patches
for i in range(len(patches)):
   y = patches[i].get_y() + patches[i].get_height()-0.05
   x = patches[i].get_width() + 0.01
   ax.annotate(str(team_wl.values[i]), (x, y), ha='left')
plt.title("Highest Winning NFL Teams from 2009 to 2019")
plt.ylabel('Team')
plt.xlabel("Win/Loss Ratio")
plt.show()

It is seen that the same few teams are on the top and bottom of each list, namely the New England Patriots and Green Bay Packers on top, and the Cleveland Browns on the bottom just to name a few.

Question 2: Which strategies work best for scoring and winning games, and which strategies are used the most?¶

Strategies are based on the plays done during the games.

A common question in American football is if touching down or kicking a field goal is best for scoring points. We shall now attempt to answer that question. We shall quantify that here by finding the number of points scored per games using each of these methods.

There are three different types of touchdowns: pass, rush and return. After each touchdown, which gives 6 points, a player can potentially score up to two points more. This needs to be factored in when finding how many points a touchdown is expected to score.

In [37]:
touchdown_ana = (pbp_combined_raw[pbp_combined_raw.touchdown==1])[["game_id","home_team","away_team","posteam","posteam_type","defteam","side_of_field","play_type","yards_gained","field_goal_result","extra_point_result","two_point_conv_result","touchdown","pass_touchdown","rush_touchdown","return_touchdown","extra_point_attempt","two_point_attempt", "pass_length", "pass_location", "kick_distance"]]
def td_type_fn(row):
    if row.pass_touchdown == 1:
        return "pass"
    if row.rush_touchdown == 1:
        return "rush"
    if row.return_touchdown == 1:
        return "return"
touchdown_ana["tdtype"] = touchdown_ana[["pass_touchdown","rush_touchdown","return_touchdown"]].apply(td_type_fn,
    axis=1
)
(touchdown_ana.touchdown*6+touchdown_ana.extra_point_attempt+2*touchdown_ana.two_point_attempt).value_counts()
Out[37]:
7.0    15298
6.0     1739
8.0      971
dtype: int64
In [38]:
#mean points gained from each touchdown after accounting for extra point attempts
(touchdown_ana.touchdown*6+touchdown_ana.extra_point_attempt+2*touchdown_ana.two_point_attempt).mean()
Out[38]:
6.957352287872057

From this we see that a touchdown has an expected value of around 7 points. We now calculate total points scored from touchdowns and field goals per game:

In [39]:
touchdown_ana["td_points"] = (touchdown_ana.touchdown*6+touchdown_ana.extra_point_attempt+2*touchdown_ana.two_point_attempt)
In [40]:
touchdown_ana.groupby("game_id").td_points.sum()
Out[40]:
game_id
2009080950    27.0
2009081350    14.0
2009081351    35.0
2009081352    14.0
2009081353    35.0
              ... 
2020011201    50.0
2020011900    56.0
2020011901    50.0
2020012600    71.0
2020020200    42.0
Name: td_points, Length: 3636, dtype: float64
In [41]:
# mean points earned from touchdowns per game
td_per_g = touchdown_ana.groupby("game_id").td_points.sum().mean()
td_per_g
Out[41]:
34.457645764576455
In [42]:
# field goal analysis dataframe
fg_ana = (pbp_combined_raw[pbp_combined_raw.play_type=="field_goal"])[["game_id","home_team","away_team","posteam","posteam_type","defteam","side_of_field","play_type","yards_gained","field_goal_result","extra_point_result","two_point_conv_result","touchdown","pass_touchdown","rush_touchdown","return_touchdown","extra_point_attempt","two_point_attempt", "pass_length", "pass_location", "kick_distance"]]

We find the number of field goals for each game. Only the entries with field_goal_result equal to made is a field goal.

In [43]:
fg_ana.groupby("game_id").field_goal_result.value_counts()
Out[43]:
game_id     field_goal_result
2009080950  made                 3
2009081350  made                 3
            missed               1
2009081351  made                 6
            blocked              1
                                ..
2020011201  missed               1
2020011900  made                 1
2020011901  made                 3
2020012600  made                 1
2020020200  made                 3
Name: field_goal_result, Length: 5371, dtype: int64
In [44]:
# mean points earned from field goals per game
fg_per_g = fg_ana[fg_ana.field_goal_result=="made"].groupby("game_id").field_goal_result.count().mean()*3
fg_per_g
Out[44]:
9.776371308016877
In [45]:
plt.figure(figsize=(10, 4))
plt.bar(['Touchdown', 'Field Goal'], [td_per_g, fg_per_g])

plt.xlabel("Point Scoring Method")
plt.ylabel("Mean Points Scored per Game")
plt.title("Mean Points Scored per Game (Overall)")
Out[45]:
Text(0.5, 1.0, 'Mean Points Scored per Game (Overall)')
In [46]:
td_d = touchdown_ana.copy()
fg_d = fg_ana.copy()
td_d.index = pd.to_datetime((touchdown_ana.game_id//100).astype(str))
fg_d.index = pd.to_datetime((fg_ana.game_id//100).astype(str))
td_d = td_d.drop("game_id", axis=1).groupby("game_id").td_points.sum()
fg_d = fg_d.drop("game_id", axis=1)
fg_d = fg_d[fg_d.field_goal_result=="made"].groupby("game_id").field_goal_result.count()*3
In [47]:
pd.concat([td_d,fg_d], axis=1).rename(columns={"td_points":"Touchdown", "field_goal_result":"Field Goal"}).reset_index(drop=True).plot()
plt.xlabel("Game No.")
plt.ylabel("Score per Game")
plt.title("Points Scored from Each Point Scoring Method per Game")
Out[47]:
Text(0.5, 1.0, 'Points Scored from Each Point Scoring Method per Game')

From here we can see that field goals score much lower than touchdowns per game. Thus, touchdowns are better than field goals at scoring.

To find out the optimal touchdown strategies, we compare each type of touchdown:

In [48]:
fig = plt.figure(figsize=(10,10))
plt.bar(["Pass","Rush","Return"],
        [touchdown_ana.groupby("game_id").pass_touchdown.sum().mean(),
         touchdown_ana.groupby("game_id").rush_touchdown.sum().mean(),
        touchdown_ana.groupby("game_id").return_touchdown.sum().mean()])
plt.ylabel("Mean Number")
plt.xlabel("Type of Touchdown")
plt.title("Mean Number of Touchdowns Per Game")
plt.show()

Clearly, pass touchdowns are the most common form of touchdown and are thus the most viable.

In [49]:
pass_td = touchdown_ana.groupby("pass_length").pass_touchdown.sum()
return_td = touchdown_ana.groupby("pass_length").return_touchdown.sum()
In [50]:
fig = plt.figure(figsize=(10,10))
td_bar_df = pd.DataFrame({"pass":pass_td,"return":return_td},index=["deep","short"])
td_bar_df.T.plot(kind= 'bar')

plt.xlabel("Type of Touchdown")
plt.ylabel("Number of Touchdowns")
plt.title("Pass Length vs Overall Touchdowns by Type")
plt.show()
<Figure size 720x720 with 0 Axes>

It is clear that short passes are multiple times better than deep passes at scoring touchdowns. Pass touchdowns also seem to be more effective than return touchdowns.

It is also asked, what is the best position to score a touchdown? There are three pass locations in the dataset, namely left, right and middle.

In [51]:
touchdown_strat_df = pd.DataFrame({"Pass": touchdown_ana.groupby("pass_location").pass_touchdown.sum(), "Return": touchdown_ana.groupby("pass_location").return_touchdown.sum()})
touchdown_strat_df.T.plot(kind= 'bar')
plt.xlabel("Type of Touchdown")
plt.ylabel("Number of Touchdowns")
plt.title("Pass Location vs Overall Touchdowns by Type")
plt.show()

Overall, the touchdown strategy can be determined by grouping touchdowns based on the type of touchdown, as well the pass length and the pass location for pass and return touchdowns, and the yards gained for rush touchdowns.

We now do the same for the field goal data, and finding the strategy of where best to kick a field goal from.

In [52]:
fig = plt.figure(figsize=(12,5))
sns.violinplot(y=pd.concat([fg_ana["kick_distance"],fg_ana["kick_distance"]], ignore_index=True),
               x=pd.concat([pd.Series(len(fg_ana["field_goal_result"])*["total"]),fg_ana["field_goal_result"]], ignore_index=True), 
               data=fg_ana);
plt.xlabel("Type of Field Goal")
plt.ylabel("Kick Distance")
plt.title("Violinplot of Kick Distance vs Field Goals Scored")
plt.show()

From this we can see that not only is the chance of making a field goal the highest at about 30 yards away from the goalpost. Both missing and blocking is substantially higher at around 50 yards from the goalpost, where the maximum number of field goal attempts are made. Thus, the best distance from the goalpost to try to score a field goal is slightly lower than the overall median, which is around 39 yards, concluding the overall field goal strategy.

Additionally, we will find out if passing, punting or running is the best method to gain yards.

In [53]:
yds_ana = pbp_combined_raw[["game_id","home_team","away_team","posteam","posteam_type","defteam","side_of_field","play_type","yards_gained", "pass_length", "pass_location", "kick_distance"]]
yds_ana = yds_ana[yds_ana.play_type.isin(["pass","run", "punt"])]
def punt_yards(row):
    if row.play_type=="punt":
        row.yards_gained = row.kick_distance
    return row
yds_ana = yds_ana.apply(punt_yards, axis="columns")
In [54]:
yds_ana1 = yds_ana[yds_ana.play_type.isin(["run","pass","punt"])]
fig = plt.figure(figsize=(15,15))
sns.violinplot(y=yds_ana1.yards_gained,
               x=yds_ana1.play_type, 
               data=yds_ana1);
plt.xlabel("Play Type")
plt.ylabel("Yards Gained")
plt.title("Violinplot of Type of Play and Yards Gained")
plt.show()                      

Thus, the best method of gaining yards is punting. However, it is noted that punts occur much less than passes or runs and as a result, players should not just punt to gain yards.

Shockingly, many passes have a yards_gained of zero. After further investigations, it seems that a pass in the dataset can either be incomplete, resulting in only an air_yards entry, or a complete pass, resulting in an air_yards entry but having zero for the yards_after_catch entry, or positive values for both air_yards and yards_after_catch. We shall now analyse this to find out more about passes.

A pass is counted as when a player throws the ball to another player (the distance is air_yards), and when the said other player runs with the ball until he loses possession (the distance is yards_after_catch).

In [55]:
pass_ana = pbp_combined_raw[["game_id","home_team","away_team","posteam","posteam_type","defteam","side_of_field","play_type","yards_gained", "air_yards", "yards_after_catch"]][pbp_combined_raw.play_type=="pass"]
pass_ana
Out[55]:
game_id home_team away_team posteam posteam_type defteam side_of_field play_type yards_gained air_yards yards_after_catch
2 2010010901 CIN NYJ CIN home NYJ NYJ pass 0.0 5.0 NaN
3 2010010901 CIN NYJ CIN home NYJ NYJ pass 4.0 4.0 0.0
5 2010010901 CIN NYJ NYJ away CIN NYJ pass 11.0 1.0 10.0
9 2010010901 CIN NYJ NYJ away CIN NYJ pass 1.0 1.0 0.0
12 2010010901 CIN NYJ CIN home NYJ NYJ pass 0.0 8.0 NaN
... ... ... ... ... ... ... ... ... ... ... ...
645292 2019122915 SEA SF SEA home SF SF pass 0.0 12.0 NaN
645295 2019122915 SEA SF SEA home SF SF pass 11.0 11.0 0.0
645298 2019122915 SEA SF SEA home SF SF pass 0.0 5.0 NaN
645299 2019122915 SEA SF SEA home SF SF pass 0.0 5.0 NaN
645300 2019122915 SEA SF SEA home SF SF pass 4.0 4.0 0.0

267239 rows × 11 columns

In [56]:
# find mean of air yards divided by yards gained
pass_ana1 = pass_ana[pass_ana.yards_gained!=0]
(pass_ana1.air_yards/pass_ana1.yards_gained).mean()
Out[56]:
0.5024565627059023
In [57]:
pass_dist = [pass_ana[pass_ana.yards_after_catch.isna()].size, # incomplete passes
 pass_ana[(pass_ana.yards_after_catch==0) & (pass_ana.air_yards>0)].size, # zero yards after catch passes
 pass_ana[(pass_ana.yards_after_catch>0) & (pass_ana.air_yards>0)].size] # positive value passes
In [58]:
fig = px.pie(values=pass_dist,
    names=["Incomplete Passes", "Zero Yards After Catch Passes", "Other Passes"],
    color_discrete_sequence=px.colors.sequential.Turbo)
fig.show()

Question 3: Do external factors such as weather and time of year make a significant difference in performance?¶

Performance can be measured in several ways, and the ones that will be analysed here are the total score, the score ratio and the score difference.

The weather data includes measurements of temperature, humidity, wind speed and wind chill. We draw a scatterplot and a boxplot for each of these measurements, with total score as the y-axis and the measurement as the x-axis.

In [59]:
weather["total_score"] = weather.home_score + weather.away_score
In [60]:
plt.figure(figsize=(15,10))
sns.scatterplot(y=weather["total_score"], x=weather["temperature"],  data=weather)
plt.show()
In [61]:
plt.figure(figsize=(15,10))
sns.scatterplot(y=weather["total_score"], x=weather["humidity"],  data=weather)
plt.show()
In [62]:
plt.figure(figsize=(15,10))
sns.scatterplot(y=weather["total_score"], x=weather["wind_mph"],  data=weather)
plt.show()
In [63]:
plt.figure(figsize=(15,10))
sns.scatterplot(y=weather["total_score"], x=weather["wind_chill"],  data=weather)
plt.show()
In [64]:
plt.figure(figsize=(15,10))
sns.boxplot(y=weather["total_score"], x=weather["temperature"],  data=weather)
plt.show()
In [65]:
plt.figure(figsize=(15,10))
sns.boxplot(y=weather["total_score"], x=weather["humidity"],  data=weather)
plt.show()
In [66]:
plt.figure(figsize=(15,10))
sns.boxplot(y=weather["total_score"], x=weather["wind_mph"],  data=weather)
plt.show()
In [67]:
plt.figure(figsize=(15,10))
sns.boxplot(y=weather["total_score"], x=weather["wind_chill"],  data=weather)
plt.show()

We observe that the higher the wind speed, the lower the total score. Unfortunately, the temperature, humidity and wind chill all do not seem to have an effect on teams' total score. We shall repeat this but with the winning team's goals divided by the losing team's goals.

In [68]:
weather["win_ratio"] = np.maximum(weather.home_score,weather.away_score)/np.minimum(weather.home_score,weather.away_score)
weather["win_ratio"]
Out[68]:
0        2.047619
1        1.250000
2        1.214286
3        1.105263
4             inf
           ...   
11187    1.700000
11188    1.050000
11189    2.857143
11190    2.428571
11191    2.857143
Name: win_ratio, Length: 11192, dtype: float64
In [69]:
plt.figure(figsize=(15,10))
sns.scatterplot(y=weather["win_ratio"], x=weather["temperature"],  data=weather)
plt.show()
In [70]:
plt.figure(figsize=(15,10))
sns.scatterplot(y=weather["win_ratio"], x=weather["humidity"],  data=weather)
plt.show()
In [71]:
plt.figure(figsize=(15,10))
sns.scatterplot(y=weather["win_ratio"], x=weather["wind_mph"],  data=weather)
plt.show()
In [72]:
plt.figure(figsize=(15,10))
sns.scatterplot(y=weather["win_ratio"], x=weather["wind_chill"],  data=weather)
plt.show()
In [73]:
plt.figure(figsize=(15,10))
sns.boxplot(y=weather["total_score"], x=weather["temperature"],  data=weather)
plt.show()
In [74]:
plt.figure(figsize=(15,10))
sns.boxplot(y=weather["total_score"], x=weather["humidity"],  data=weather)
plt.show()
In [75]:
plt.figure(figsize=(15,10))
sns.boxplot(y=weather["total_score"], x=weather["wind_mph"],  data=weather)
plt.show()
In [76]:
plt.figure(figsize=(15,10))
sns.boxplot(y=weather["total_score"], x=weather["wind_chill"],  data=weather)
plt.show()

We observe that the higher the wind speed, the lower the total score. Unfortunately, the temperature, humidity and wind chill all do not seem to have an effect on teams' total score. We shall repeat this but with the winning team's goals divided by the losing team's goals.

In [77]:
weather["win_ratio"] = np.maximum(weather.home_score,weather.away_score)/np.minimum(weather.home_score,weather.away_score)
weather["win_ratio"]
Out[77]:
0        2.047619
1        1.250000
2        1.214286
3        1.105263
4             inf
           ...   
11187    1.700000
11188    1.050000
11189    2.857143
11190    2.428571
11191    2.857143
Name: win_ratio, Length: 11192, dtype: float64
In [78]:
plt.figure(figsize=(15,10))
sns.boxplot(y=weather["win_ratio"], x=weather["temperature"],  data=weather)
plt.show()
In [79]:
plt.figure(figsize=(15,10))
sns.boxplot(y=weather["win_ratio"], x=weather["humidity"],  data=weather)
plt.show()
In [80]:
plt.figure(figsize=(15,10))
sns.boxplot(y=weather["win_ratio"], x=weather["wind_mph"],  data=weather)
plt.show()
In [81]:
plt.figure(figsize=(15,10))
sns.boxplot(y=weather["win_ratio"], x=weather["wind_chill"],  data=weather)
plt.show()

Unfortunately, the temperature, humidity, wind speed and wind chill all do not seem to have an effect on teams' win ratio. We shall repeat this but with the difference in scores.

In [82]:
weather["score_diff"] = np.maximum(weather.home_score,weather.away_score)-np.minimum(weather.home_score,weather.away_score)
weather["score_diff"]
Out[82]:
0        22
1         7
2         3
3         2
4        20
         ..
11187    14
11188     1
11189    13
11190    20
11191    13
Name: score_diff, Length: 11192, dtype: int64
In [83]:
plt.figure(figsize=(15,10))
sns.scatterplot(y=weather["score_diff"], x=weather["temperature"],  data=weather)
plt.show()
In [84]:
plt.figure(figsize=(15,10))
sns.scatterplot(y=weather["score_diff"], x=weather["humidity"],  data=weather)
plt.show()
In [85]:
plt.figure(figsize=(15,10))
sns.scatterplot(y=weather["score_diff"], x=weather["wind_mph"],  data=weather)
plt.show()
In [86]:
plt.figure(figsize=(15,10))
sns.scatterplot(y=weather["score_diff"], x=weather["wind_chill"],  data=weather)
plt.show()
In [87]:
weather["score_diff"] = np.maximum(weather.home_score,weather.away_score)-np.minimum(weather.home_score,weather.away_score)
weather["score_diff"]
Out[87]:
0        22
1         7
2         3
3         2
4        20
         ..
11187    14
11188     1
11189    13
11190    20
11191    13
Name: score_diff, Length: 11192, dtype: int64
In [88]:
plt.figure(figsize=(15,10))
sns.boxplot(y=weather["score_diff"], x=weather["temperature"],  data=weather)
plt.show()
In [89]:
plt.figure(figsize=(15,10))
sns.boxplot(y=weather["score_diff"], x=weather["humidity"],  data=weather)
plt.show()
In [90]:
plt.figure(figsize=(15,10))
sns.boxplot(y=weather["score_diff"], x=weather["wind_mph"],  data=weather)
plt.show()
In [91]:
plt.figure(figsize=(15,10))
sns.boxplot(y=weather["score_diff"], x=weather["wind_chill"],  data=weather)
plt.show()

Unfortunately, the temperature, humidity, wind speed and wind chill all do not seem to have an effect on teams' score difference. We can thus conclude that the only weather factor that has a correlation with teams' performance is wind speed on total score.

In [92]:
def doy(y,m,d):
    dom = np.array([31,28,31,30,31,30,31,31,30,31,30,31])
    total = 0
    if ((y%4==0 and y%100!=0) or y%400==0) and m >= 3:
        total += 1
    total += np.sum(dom[:(m-1)])
    total += d
    return total

def ds2010(y,m,d):
    total = doy(y,m,d)
    for i in range(2010, y):
        if (y%4==0 and y%100!=0) or y%400==0:
            total += 366
        else:
            total += 365
    return total
In [93]:
ag_date = pd.to_datetime((df_ag.game_id//100).astype(str))
df_ag["total_score"] = df_ag.home_score + df_ag.away_score
df_ag.insert(2,"date", ag_date)
In [94]:
games_by_date = df_ag.set_index("date")

fig, ax = plt.subplots(2,1,figsize=(15,10))
score_means = df_ag.groupby("date").total_score.mean()
score_means.plot(ax=ax[0])
score_means_rolling = df_ag.groupby("date").total_score.rolling(3).mean().reset_index(1,drop=True)
score_means_rolling.plot(ax=ax[1])
Out[94]:
<AxesSubplot:xlabel='date'>
In [95]:
srs = score_means_rolling[score_means_rolling.index.year == 2009]
srs.reset_index(drop=True)
Out[95]:
0            NaN
1            NaN
2            NaN
3      35.000000
4      31.333333
         ...    
300    41.666667
301    41.333333
302    42.333333
303    39.333333
304          NaN
Name: total_score, Length: 305, dtype: float64
In [96]:
fig, ax = plt.subplots(6,2)
for yr in range(2009,2021):
    srs = score_means[score_means.index.year == yr]
    srs = srs.reset_index(drop=True)
    srs.plot(ax=ax[(yr-2009)//2,1-yr%2])

Question 4: Are there any significant trends in players, for example, who has played as which position the most?¶

The unique positions are shown below:

In [97]:
rosters.position.unique()
Out[97]:
array(['Quarterback', 'Running Back', 'Tight End', 'Wide Receiver',
       'Full Back'], dtype=object)

We first match the team name abbreviations to the names of the teams using the dictionary from earlier:

In [98]:
rosters.team = rosters.team.map(teams)
rosters.head(5)
Out[98]:
season season_type full_player_name abbr_player_name team position gsis_id
0 2009 post Aaron Rodgers A.Rodgers Green Bay Packers Quarterback 00-0023459
1 2009 post Adrian Peterson A.Peterson Minnesota Vikings Running Back 00-0025394
2 2009 post Ahman Green A.Green Green Bay Packers Running Back 00-0006305
3 2009 post Alex Smith A.Smith Philadelphia Eagles Tight End 00-0023506
4 2009 post Andre Caldwell A.Caldwell Cincinnati Bengals Wide Receiver 00-0026237

Then, we see which position in which team has been played the most.

In [99]:
# most played team and position
rosters.groupby(["team","position"]).full_player_name.nunique().idxmax()
Out[99]:
('Detroit Lions', 'Wide Receiver')
In [100]:
# number of times the max team and position has been played
rosters.groupby(["team","position"]).full_player_name.nunique().max()
Out[100]:
71
In [101]:
# all said players
rosters[(rosters.team == "Detroit Lions") & (rosters.position == "Wide Receiver")].full_player_name.unique()
Out[101]:
array(['Calvin Johnson', 'Nate Burleson', 'Titus Young', 'Corey Fuller',
       'Golden Tate', 'Jeremy Ross', 'Anquan Boldin', 'Marvin Jones',
       'T.J. Jones', 'Billy McMullen', 'Bobby Sippio', 'D.J. Boldin',
       'Dane Looker', 'Keary Colbert', 'Brian Clark', 'Bryant Johnson',
       'Dennis Northcutt', 'Derrick Williams', 'Eric Fowler', 'Tim Toone',
       'Demario Ballard', 'Dominique Barnes', 'Marcus Harris',
       'Maurice Stovall', 'Nate Hughes', 'Rashied Davis',
       'Dominique Curry', 'Jarett Dillard', 'Lance Long', 'Pat Edwards',
       'Ryan Broyles', 'Stefan Logan', 'Terrence Toliver',
       'Wallace Miles', 'Cody Wilson', 'Kris Durham', 'Matt Willis',
       'Mike Thomas', 'Terrence Austin', 'Andrew Peacock',
       'Quintin Payton', 'Jarred Haggins', 'Lance Moore',
       'Vernon Johnson', 'Andre Roberts', 'Jace Billingsley', 'Jay Lee',
       'Quinshad Davis', 'Dontez Ford', 'Jared Abbrederis',
       'Kenny Golladay', 'Keshawn Martin', 'Michael Rector',
       'Noel Thomas', 'Bradley Marquez', 'Brandon Powell', 'Brian Brown',
       'Chris Lacy', 'Andy Jones', 'Danny Amendola', 'Deontez Alexander',
       'Jonathan Duhart', 'Jordan Lasley', 'Tom Kennedy',
       'Tommylee Lewis', 'Travis Fulgham', 'John Standeford',
       'Brian Robiskie', 'Dorin Dickerson', 'Kevin Ogletree',
       'Bruce Ellington'], dtype=object)

We'll find which player has been in the most number of different teams.

In [102]:
rosters[rosters.gsis_id == rosters.groupby(["gsis_id"]).team.nunique().idxmax()]
Out[102]:
season season_type full_player_name abbr_player_name team position gsis_id
3171 2012 pre Terrelle Pryor T.Pryor Las Vegas Raiders Quarterback 00-0028825
3839 2013 pre Terrelle Pryor T.Pryor Las Vegas Raiders Quarterback 00-0028825
4512 2014 pre Terrelle Pryor T.Pryor Seattle Seahawks Quarterback 00-0028825
5804 2016 pre Terrelle Pryor T.Pryor Cleveland Browns Wide Receiver 00-0028825
6482 2017 pre Terrelle Pryor T.Pryor Washington Commanders Wide Receiver 00-0028825
7228 2018 pre Terrelle Pryor T.Pryor New York Jets Wide Receiver 00-0028825
7952 2019 pre Terrelle Pryor T.Pryor Jacksonville Jaguars Wide Receiver 00-0028825
10049 2012 reg Terrelle Pryor T.Pryor Las Vegas Raiders Quarterback 00-0028825
10564 2013 reg Terrelle Pryor T.Pryor Las Vegas Raiders Quarterback 00-0028825
11609 2015 reg Terrelle Pryor T.Pryor Cleveland Browns Wide Receiver 00-0028825
12152 2016 reg Terrelle Pryor T.Pryor Cleveland Browns Wide Receiver 00-0028825
12684 2017 reg Terrelle Pryor T.Pryor Washington Commanders Wide Receiver 00-0028825
13229 2018 reg Terrelle Pryor T.Pryor Buffalo Bills Wide Receiver 00-0028825

We note that the player who played for the most different teams is Terelle Pryor, who has played for seven different teams from 2012 to 2019.

Now, let's find which player has played in the most number of positions.

In [103]:
# find the maximum number of positions a single player has been in
rosters.groupby("gsis_id").position.nunique().max()
Out[103]:
2
In [104]:
position_df = rosters.groupby("gsis_id").position.nunique().to_frame()
position_df = position_df[position_df.position == position_df.position.max()].rename(columns={"position":"pos_count"})

It seems that multiple players have played in 2 positions before. We will now find their names below.

In [105]:
max_positions = pd.merge(rosters[["full_player_name","gsis_id","position"]],
                 position_df,
                 on='gsis_id',
                 how='inner')
In [106]:
with np.printoptions(threshold=np.inf):
    print(max_positions.full_player_name.unique())
['Danny Woodhead' 'Dexter McCluster' 'Jason Snelling' 'Michael Robinson'
 'James Casey' 'Lousaka Polite' 'Jacob Hester' 'Joe Webb' 'Mike Tolbert'
 'Robert Hughes' 'Jamize Olawale' 'Ty Montgomery' 'Zach Zenner' 'C.J. Ham'
 'Logan Thomas' 'Zach Line' 'Brock Bolen' 'Dorin Dickerson'
 'Earnest Graham' 'Ernest Wilford' 'Greg Jones' 'Jamie McCoy'
 'Jason Pociask' 'John Conner' 'Lex Hilliard' 'Manase Tonga'
 'Montell Owens' 'Peyton Hillis' 'Tim Castille' 'Frank Summers'
 'Mike Sellers' 'Niles Paul' 'Stanley Havili' 'Emil Igwenagu'
 'Terrelle Pryor' 'Gerell Robinson' 'Ifeanyi Momah' 'Jason Schepler'
 'Jeremy Stewart' 'Patrick DiMarco' 'Anthony Sherman' "De'Anthony Thomas"
 'Marcus Lucas' 'Daniel Brown' 'Darren Waller' 'Neal Sterling'
 'Byron Marshall' 'Moritz Boehringer' 'Shane Smith' 'Vince Mayle'
 'Bug Howard']
In [107]:
max_positions.groupby("full_player_name").position.unique().apply(set).value_counts()
Out[107]:
{Running Back, Full Back}        24
{Wide Receiver, Tight End}       12
{Wide Receiver, Running Back}     5
{Tight End, Full Back}            5
{Tight End, Running Back}         2
{Wide Receiver, Quarterback}      2
{Tight End, Quarterback}          1
Name: position, dtype: int64

It seems that most players who double their roles do so as running backs and full backs, followed by wide receivers and tight ends.

We now find the trends in the details of the players themselves.

In [108]:
sns.violinplot(players.year_end-players.year_start)
plt.xlabel("Years Played")
plt.title("Violinplot of Number of Years Played by Players")
Out[108]:
Text(0.5, 1.0, 'Violinplot of Number of Years Played by Players')

We observe that the median number of years played is 2, and the maximum is 21.

In [109]:
sns.boxplot(x=players.year_start, y=players.weight)
Out[109]:
<AxesSubplot:xlabel='year_start', ylabel='weight'>
In [110]:
sns.boxplot(x=players.year_end-players.year_start, y=players.weight)
plt.xlabel("Years Played")
Out[110]:
Text(0.5, 0, 'Years Played')

There does not seem to be a correlation between the number of years a player has played or when they started playing and their weight.

Results Findings & Conclusion¶

For each research question, summarize in 2-3 visualizations which will answer the question. Intrepret the results accordingly and give your observation and conclusion. The visualizations should be well presented (apply what you have learnt in Chapter 9 on data communication). The plots shown here could be an enhanced version of the EDA plots, or presented in another format.

Question 1: Which teams have won the most games and which teams scored the most per game?¶

In [111]:
sns.set(rc={'figure.figsize':(15,9)})
mean_score = team_scores/team_games
mean_score = mean_score.sort_values(ascending=False)
ax = sns.barplot(x=mean_score.values, y=mean_score.index)
sns.despine(left=True, bottom=True)
patches = ax.patches
for i in range(len(patches)):
   y = patches[i].get_y() + patches[i].get_height()-0.05
   x = patches[i].get_width() + 0.30
   ax.annotate(str(mean_score.values[i]), (x, y), ha='left')
plt.title("Highest Scoring NFL Teams from 2009 to 2019")
plt.ylabel('Team')
plt.xlabel("Mean Points Scored")
plt.show()
In [112]:
sns.set(rc={'figure.figsize':(15,9)})
team_scores = team_scores.sort_values(ascending=False)
ax = sns.barplot(x=team_scores.values, y=team_scores.index)
sns.despine(left=True, bottom=True)
patches = ax.patches
for i in range(len(patches)):
   y = patches[i].get_y() + patches[i].get_height()-0.05
   x = patches[i].get_width() + 0.80
   ax.annotate(str(team_scores.values[i]), (x, y), ha='left')
plt.title("Highest Scoring NFL Teams from 2009 to 2019")
plt.ylabel('Team')
plt.xlabel("Total Points Scored")
plt.show()
In [113]:
sns.set(rc={'figure.figsize':(15,9)})
team_wins = team_wins.sort_values(ascending=False)
ax = sns.barplot(x=team_wins.values, y=team_wins.index)
sns.despine(left=True, bottom=True)
patches = ax.patches
for i in range(len(patches)):
   y = patches[i].get_y() + patches[i].get_height()-0.05
   x = patches[i].get_width() + 0.80
   ax.annotate(str(team_wins.values[i]), (x, y), ha='left')
plt.title("Highest Winning NFL Teams from 2009 to 2019")
plt.ylabel('Team')
plt.xlabel("Games Won")
plt.show()
In [114]:
sns.set(rc={'figure.figsize':(15,9)})
team_ratio = team_wins/team_games
team_ratio = team_ratio.sort_values(ascending=False)
ax = sns.barplot(x=team_ratio.values, y=team_ratio.index)
sns.despine(left=True, bottom=True)
patches = ax.patches
for i in range(len(patches)):
   y = patches[i].get_y() + patches[i].get_height()-0.05
   x = patches[i].get_width() + 0.005
   ax.annotate(str(team_ratio.values[i]), (x, y), ha='left')
plt.title("Highest Winning NFL Teams from 2009 to 2019")
plt.ylabel('Team')
plt.xlabel("Winrate")
plt.show()
In [115]:
sns.set(rc={'figure.figsize':(15,9)})
team_wl = team_wins/team_losses
team_wl = team_wl.sort_values(ascending=False)
ax = sns.barplot(x=team_wl.values, y=team_wl.index)
sns.despine(left=True, bottom=True)
patches = ax.patches
for i in range(len(patches)):
   y = patches[i].get_y() + patches[i].get_height()-0.05
   x = patches[i].get_width() + 0.01
   ax.annotate(str(team_wl.values[i]), (x, y), ha='left')
plt.title("Highest Winning NFL Teams from 2009 to 2019")
plt.ylabel('Team')
plt.xlabel("Win/Loss Ratio")
plt.show()

The New England Patriots and Green Bay Packers have the first and third highest results respectively in all the graphs. Baltimore Ravens has the second highest result in 3 graphs and New Orleans Saints has the second highest result in 2 graphs. On the other hand, the Jacksonville Jaguars and Cleveland Browns are consistently the bottom two teams. Thus, we can conclude that the New England Patriots is the best team overall, having both won the most games and scored the most per game.

Question 2. Which strategies work best for scoring and winning games, and what are the most used strategies?¶

The strategy for scoring field goals is shown below:

In [116]:
fig = plt.figure(figsize=(12,5))
sns.violinplot(y=pd.concat([fg_ana["kick_distance"],fg_ana["kick_distance"]], ignore_index=True),
               x=pd.concat([pd.Series(len(fg_ana["field_goal_result"])*["total"]),fg_ana["field_goal_result"]], ignore_index=True), 
               data=fg_ana);
plt.xlabel("Type of Field Goal")
plt.ylabel("Kick Distance")
plt.title("Violinplot of Kick Distance vs Total Field Goals Scored")
plt.show()

From this we can see that not only is the chance of making a field goal the highest at about 30 yards away from the goalpost. Both missing and blocking is substantially higher at around 50 yards from the goalpost, where the maximum number of field goal attempts are made. Thus, the best distance from the goalpost to try to score a field goal is slightly lower than the overall median, which is around 39 yards, concluding the field goal strategy.

In [117]:
fig = plt.figure(
    FigureClass=Waffle, 
    rows=10, 
    columns=40, 
    values=[td_per_g, fg_per_g],
    figsize=(12, 8),  # figsize is a parameter of matplotlib.pyplot.figure
    labels = ["Touchdown points per game", "Field Goal points per game"],
    legend={
        'loc': 'lower left',
        'bbox_to_anchor': (0, -0.2),
        'framealpha': 0,
        'fontsize': 12
        }
)
plt.title("Average Touchdown Points vs Field Goal Points scored per game")
plt.show()

The mean points scored per game overall is higher for touchdowns than for field goals.

In [118]:
plt.figure(figsize=(10, 4))
plt.bar(['Touchdown', 'Field Goal'], [td_per_g, fg_per_g])

plt.xlabel("Point Scoring Method")
plt.ylabel("Mean Points Scored per Game")
plt.title("Mean Points Scored per Game (Overall)")
Out[118]:
Text(0.5, 1.0, 'Mean Points Scored per Game (Overall)')

Below, we can see that field goals score much lower than touchdowns per game. Combined with the above bar chart, touchdowns are better than field goals at scoring.

In [119]:
pd.concat([td_d,fg_d], axis=1).rename(columns={"td_points":"Touchdown", "field_goal_result":"Field Goal"}).reset_index(drop=True).plot()
plt.xlabel("Game No.")
plt.ylabel("Score per Game")
plt.title("Points Scored from Each Point Scoring Method per Game")
Out[119]:
Text(0.5, 1.0, 'Points Scored from Each Point Scoring Method per Game')

To find out the optimal touchdown strategies, we compare each type of touchdown:

In [120]:
fig = plt.figure(figsize=(10,10))
plt.bar(["Pass","Rush","Return"],
        [touchdown_ana.groupby("game_id").pass_touchdown.sum().mean(),
         touchdown_ana.groupby("game_id").rush_touchdown.sum().mean(),
        touchdown_ana.groupby("game_id").return_touchdown.sum().mean()])
plt.ylabel("Mean Number")
plt.xlabel("Type of Touchdown")
plt.title("Mean Number of Touchdowns Per Game")
plt.show()

Pass touchdowns are the most common form of touchdown.

In [121]:
fig = plt.figure(figsize=(10,10))
td_bar_df = pd.DataFrame({"pass":pass_td,"return":return_td},index=["deep","short"])
td_bar_df.T.plot(kind= 'bar')

plt.xlabel("Type of Touchdown")
plt.ylabel("Number of Touchdowns")
plt.title("Pass Length vs Overall Touchdowns by Type")
plt.show()
<Figure size 720x720 with 0 Axes>

It is clear that short passes are multiple times better than deep passes at scoring touchdowns. Pass touchdowns also seem to be more effective than return touchdowns.

The right side of the field is the best place to score a touchdown, and the middle of the field is the worst.

In [122]:
touchdown_strat_df = pd.DataFrame({"Pass": touchdown_ana.groupby("pass_location").pass_touchdown.sum(), "Return": touchdown_ana.groupby("pass_location").return_touchdown.sum()})
touchdown_strat_df.T.plot(kind= 'bar')
plt.xlabel("Type of Touchdown")
plt.ylabel("Number of Touchdowns")
plt.title("Pass Location vs Overall Touchdowns by Type")
plt.show()

Overall, the touchdown strategy can be determined by grouping touchdowns based on the type of touchdown, as well the pass length and the pass location for pass and return touchdowns, and the yards gained for rush touchdowns.

Teh strategy of where best to kick a field goal from is shown below.

In [123]:
fig = plt.figure(figsize=(12,5))
sns.violinplot(y=pd.concat([fg_ana["kick_distance"],fg_ana["kick_distance"]], ignore_index=True),
               x=pd.concat([pd.Series(len(fg_ana["field_goal_result"])*["total"]),fg_ana["field_goal_result"]], ignore_index=True), 
               data=fg_ana);
plt.xlabel("Type of Field Goal")
plt.ylabel("Kick Distance")
plt.title("Violinplot of Kick Distance vs Field Goals Scored")
plt.show()

From this we can see that not only is the chance of making a field goal the highest at about 30 yards away from the goalpost. Both missing and blocking is substantially higher at around 50 yards from the goalpost, where the maximum number of field goal attempts are made. Thus, the best distance from the goalpost to try to score a field goal is slightly lower than the overall median, which is around 39 yards, concluding the overall field goal strategy.

Out of passing, punting and running, punting is the best method to gain yards as shown below.

In [124]:
yds_ana1 = yds_ana[yds_ana.play_type.isin(["run","pass","punt"])]
fig = plt.figure(figsize=(15,15))
sns.violinplot(y=yds_ana1.yards_gained,
               x=yds_ana1.play_type, 
               data=yds_ana1);
plt.xlabel("Play Type")
plt.ylabel("Yards Gained")
plt.title("Violinplot of Type of Play and Yards Gained")
plt.show()                      
In [125]:
fig = px.pie(values=pass_dist,
    names=["Incomplete Passes", "Zero Yards After Catch Passes", "Other Passes"],
    color_discrete_sequence=px.colors.sequential.Turbo)
fig.show()

Question 3. Do external factors such as weather and time of year make a significant difference in performance?¶

The only environmental factor that affects any team performance indicator is wind speed, and it only affects the total score. As the wind speed increases from 0 to 21 miles per hour, the total score consistently decreases.

In [126]:
plt.figure(figsize=(15,10))
sns.boxplot(y=weather["total_score"], x=weather["wind_mph"],  data=weather)
plt.xlabel("Wind Speed in Miles Per Hour")
plt.ylabel("Total Score per Game")
plt.title("Wind Speed vs Total Score per Game")
plt.show()

Question 4. Are there any significant trends in players, for example, who has played as which position the most?¶

In [127]:
rosters[rosters.gsis_id == rosters.groupby(["gsis_id"]).team.nunique().idxmax()]
Out[127]:
season season_type full_player_name abbr_player_name team position gsis_id
3171 2012 pre Terrelle Pryor T.Pryor Las Vegas Raiders Quarterback 00-0028825
3839 2013 pre Terrelle Pryor T.Pryor Las Vegas Raiders Quarterback 00-0028825
4512 2014 pre Terrelle Pryor T.Pryor Seattle Seahawks Quarterback 00-0028825
5804 2016 pre Terrelle Pryor T.Pryor Cleveland Browns Wide Receiver 00-0028825
6482 2017 pre Terrelle Pryor T.Pryor Washington Commanders Wide Receiver 00-0028825
7228 2018 pre Terrelle Pryor T.Pryor New York Jets Wide Receiver 00-0028825
7952 2019 pre Terrelle Pryor T.Pryor Jacksonville Jaguars Wide Receiver 00-0028825
10049 2012 reg Terrelle Pryor T.Pryor Las Vegas Raiders Quarterback 00-0028825
10564 2013 reg Terrelle Pryor T.Pryor Las Vegas Raiders Quarterback 00-0028825
11609 2015 reg Terrelle Pryor T.Pryor Cleveland Browns Wide Receiver 00-0028825
12152 2016 reg Terrelle Pryor T.Pryor Cleveland Browns Wide Receiver 00-0028825
12684 2017 reg Terrelle Pryor T.Pryor Washington Commanders Wide Receiver 00-0028825
13229 2018 reg Terrelle Pryor T.Pryor Buffalo Bills Wide Receiver 00-0028825

The maximum number of teams someone has been in is 7, and that person is Terelle Pryor.

A distribution of how many teams each player has played in is shown below. Surprisingly, almost 40% of players have been in more than one team.

In [128]:
dist_num_teams = rosters.groupby(["gsis_id"]).team.nunique().reset_index(drop=True).value_counts()
fig = px.pie(values=dist_num_teams,
    names=dist_num_teams.index,
    color_discrete_sequence=px.colors.sequential.Aggrnyl)
fig.show()

The maximum number of positions someone has been in is 2:

In [129]:
rosters.groupby("gsis_id").position.nunique().max()
Out[129]:
2

A total of 51 players have played in 2 positions, as shown below.

In [130]:
max_positions.full_player_name.unique().size
Out[130]:
51
In [131]:
position_df = rosters.groupby("gsis_id").position.nunique().to_frame()
position_df = position_df[position_df.position == position_df.position.max()].rename(columns={"position":"pos_count"})
max_positions = pd.merge(rosters[["full_player_name","gsis_id","position"]],
                 position_df,
                 on='gsis_id',
                 how='inner')
pos_pairs = max_positions.groupby("full_player_name").position.unique().apply(set).value_counts()
with np.printoptions(threshold=np.inf):
    print(max_positions.full_player_name.unique())
['Danny Woodhead' 'Dexter McCluster' 'Jason Snelling' 'Michael Robinson'
 'James Casey' 'Lousaka Polite' 'Jacob Hester' 'Joe Webb' 'Mike Tolbert'
 'Robert Hughes' 'Jamize Olawale' 'Ty Montgomery' 'Zach Zenner' 'C.J. Ham'
 'Logan Thomas' 'Zach Line' 'Brock Bolen' 'Dorin Dickerson'
 'Earnest Graham' 'Ernest Wilford' 'Greg Jones' 'Jamie McCoy'
 'Jason Pociask' 'John Conner' 'Lex Hilliard' 'Manase Tonga'
 'Montell Owens' 'Peyton Hillis' 'Tim Castille' 'Frank Summers'
 'Mike Sellers' 'Niles Paul' 'Stanley Havili' 'Emil Igwenagu'
 'Terrelle Pryor' 'Gerell Robinson' 'Ifeanyi Momah' 'Jason Schepler'
 'Jeremy Stewart' 'Patrick DiMarco' 'Anthony Sherman' "De'Anthony Thomas"
 'Marcus Lucas' 'Daniel Brown' 'Darren Waller' 'Neal Sterling'
 'Byron Marshall' 'Moritz Boehringer' 'Shane Smith' 'Vince Mayle'
 'Bug Howard']
In [132]:
max_positions.full_player_name.unique().size
Out[132]:
51
In [133]:
pos_pairs.index = pos_pairs.index.astype("str")
In [134]:
fig = px.pie(values=pos_pairs,
    names=pos_pairs.index,
    color_discrete_sequence=px.colors.sequential.Jet)
fig.show()

In conclusion, most players who double their roles do so as running backs and full backs with 47.1% of them doing so, followed by wide receivers and tight ends at 23.5% and 9.8% respectively.

Most players also play for two years or less, as shown below.

In [135]:
sns.violinplot(players.year_end-players.year_start)
plt.xlabel("Years Played")
plt.title("Violinplot of Years Played per Player")
Out[135]:
Text(0.5, 1.0, 'Violinplot of Years Played per Player')

The median number of years played is 2, and the maximum is 21. A pieplot is shown below for the same statistic:

In [136]:
fig = px.pie(values=(players.year_end-players.year_start).sort_index().value_counts(),
    names=(players.year_end-players.year_start).value_counts().sort_index().index)
fig.show()
In [137]:
sns.boxplot(x=players.year_start, y=players.weight)
plt.xlabel("Year Started Playing")
plt.ylabel("Weight in Pounds")
plt.title("Boxplot of Players' Weight over Time")
Out[137]:
Text(0.5, 1.0, "Boxplot of Players' Weight over Time")
In [138]:
sns.boxplot(x=players.year_end-players.year_start, y=players.weight)
plt.xlabel("Years Played")
plt.ylabel("Weight in Pounds")
plt.title("Boxplot of Players' Weight vs their Playing Experience")
Out[138]:
Text(0.5, 1.0, "Boxplot of Players' Weight vs their Playing Experience")

The weight of the players stays about constant regardless of when they started playing or how long they played for.

Recommendations or Further Works¶

State any recommendations, improvements or further works.

In conclusion, we have explored and discussed the relationships present between various variables involved in American football as well as possible reasons behind the trends for each. While some strong trends have been found, there are also other factors present, as well as variables which have been found to be purely random.

Limitations¶

Many factors may not be correlations, due to factors like player skill which is not quantifiable. Weather data, which was previously thought to be somewhat related to team performance, has been shown to barely affect it at all. Other factors such as the composition of the teams changing over time was also not fully taken into account when finding the trend for team performance specifically. It is also unfortunate that the timeframes of the data did not fit exactly with each other, rendering it unable to interconnect the datasets substantially to find more potentially underlying trends within the data.

Improvements¶

About 230 of the columns from one of the datasets had to be dropped out of almost 260. This shows that the data is far from perfect, however, some of these columns may have had some data with correlation that was removed for a variety of reasons, for example, the fumble column of the play by play dataset, although it may contribute to the score, such actions are uncontrollable by the player in an actual game, so including it, although it may lead to a more substantial conclusion, it would just add a larger luck factor to the final trend.

References¶

Cite any references made, and links where you obtained the data. You may wish to read about how to use markdown in Jupyter notebook to make your report easier to read. https://www.ibm.com/docs/en/db2-event-store/2.0.0?topic=notebooks-markdown-jupyter-cheatsheet

All referenced documents have been included with the submission of proposal

  1. https://wordpress.com/support/markdown-quick-reference/ (you may refer to this link on markup for Jupyter when formatting your proposal)

  2. https://en.wikipedia.org/wiki/Sports_in_the_United_States#:~:text=More%20recently%20over%20the%20past,turn%20of%20the%2021st%20century.